BAYESIAN MODEL CHOICE AND INFORMATION CRITERIA 
IN SPARSE GENERALIZED LINEAR MODELS 



RINA FOYGEL AND MATHIAS DRTON 



Abstract. We consider Bayesian model selection in generalized linear models that are high-dimensional, 
with the number of covariates p being large relative to the sample size n, but sparse in that the number of 
active covariates is small compared to p. Treating the covariates as random and adopting an asymptotic 
scenario in which p increases with n, we show that Bayesian model selection using certain priors on the 
set of models is asymptotically equivalent to selecting a model using an extended Bayesian information 
criterion. Moreover, we prove that the smallest true model is selected by either of these methods with 
probability tending to one. Having addressed random covariates, we are also able to give a consistency 
^ result for pseudo-likelihood approaches to high-dimensional sparse graphical modeling. Experiments on real 

^ data demonstrate good performance of the extended Bayesian information criterion for regression and for 

graphical models. 
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d model's parameter space. The two classical criteria are Akaike's information criterion (AIC) (Akaike 19741, 



1. Introduction 

Information criteria provide a principled approach to a wide variety of model selection problems. The 
criteria strike a balance between the fit of a parametric statistical model, measured by the maximized 
likelihood function, and its complexity, measured by a penalty term that involves the dimension of the 



which targets good predictive performance, and the Bayesian information criterion (BIG) introduced in 



'— ' Schwarz (1978), which is motivated by a connection to fully Bayesian approaches to model determination 

^ and which has been proven to enjoy consistency properties in a number of settings. If we call a model 

^ "true" if it contains the underlying data-generating distribution, then consistency refers to selection of the 

smallest true model in a suitable large-sample limit. In this paper we will be concerned with the BIG for 
generalized linear models with random covariates in a sparse high-dimensional setting, where the number of 
covariates is large but only a small fraction of the covariates is related to the response. Our main results 
show that extensions of the BIG are consistent, and are accurate approximations of Bayesian procedures, 
in asymptotic scenarios where the number of covariates grows with the sample size. The results can be 
] interpreted either as giving a precise Bayesian motivation for recently-proposed information criteria, or 

, as proving that Bayesian procedures enjoy favorable frequentist properties. In particular, our work gives 

• • uniform error bounds for Laplace approximations to the large number of marginal likelihood integrals arising 

^ in a Bayesian treatment of sparse high-dimensional generalized linear models, and results in consistent model 

selection for high-dimensional graphical models with binary variables (the Ising model). 

^ 1.1. Classical theory. Suppose we observe a sample of n observations for which we consider the parametric 

model A4 with log- likelihood function £[„](^). Then, written in the most commonly encountered form, the 
BIG for this model is 

BIG(7W) = -2£[n]{0M) + dim{M) -login) , 
with a lower value being desirable. Here dim(A^) is the dimension of the model's parameter space Om, £^nd 

9m = arg max £[„] (6) 

is the maximum likelihood estimator (MLE) in model Ai. The classical large-sample theory underlying the 
BIG considers a finite family of competing models that is closed under intersection, and for which strict 
inclusion implies strictly lower dimension. In order to prove consistency of the BIG, it then suffices to 
make pairwise comparisons between models Mi and M2 showing that, with asymptotic probability one. 
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BIC(A^i) < BIC(7W2) if either (i) Mi is true and M2 is not, or (ii) both Mi and M2 are true but 
dim(A^i) < dim(A^2)- In case (i), a proof shows that the difference in the hkeUhood terms in the BIC 
outgrows the logarithmic penalty term as n — ?> 00, whereas in case (ii) the logarithmic term outgrows 
the difference in log-likelihood values, which remains bounded in probability; compare, for instance, Nishii 



(1984), Haughton 



Anderson (2002) 



( 1988 ) or monographs on model selection and information criteria such as Burnham and 



Claeskens and Hjort (2008); Konishi and Kitagawa (2008) 



The penalty term appearing in the BIC is only one of many possible choices to balance model fit and 
complexity in a way that leads to consistency. However, the logarithmic dependence on the sample size 
makes a connection to Bayesian approaches. Consider the prior /ai(^) for the parameter 9 in model M, and 
write P{M) for the prior probability of M. Then the posterior probability of M is proportional to 

(1) 



P{M)- / eMe[n]iO)} fMiO) d9, 
Je 

where the integral is commonly referred to as the marginal likelihood. In well-behaved models, for large n, 
the integrand in the marginal likelihood takes large values only in a neighborhood of the MLE 9m- Moreover, 
in such a neighborhood the log-likelihood i'[„] (9) can be approximated by a quadratic function, while the 
prior fM{()) is approximately constant. Evaluating the resulting Gaussian integral reveals that the logarithm 
of the marginal likelihood equals — V2 • BIC(A^) plus a remainder term that is bounded in probability when 
the n observations are drawn from a distribution in M and the sample size n tends to 00. The remainder 
term can be estimated to be 



1 



dim(7W)log(27r) +log/. 



1 
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where i?r„] is the Hessian of the negated log- likelihood function (and scales with n) . The work of 



Haughton 
likelihood 



( 1988 ) provides a rigorous probabilistic treatment of this Laplace approximation to the marginal 
in the general setting of smooth (or curved) exponential families. Considering a finite family of models 
suffices to treat one model at a time in this analysis. 



it 



1.2. Recent extensions of the BIC. In the last decade, new applications of the BIC have emerged in 
problems of selecting sparse models for high-dimensional data, including problems such as tuning parameter 
selection in Lasso and related regularization procedures; see e.g. Zou et al. (2007). Following a proposal 
from Bogdan et al. (20041), the work of Chen and Chen (2008) treats an extended BIC for variable selection 



in sparse high-dimensional linear regression with deterministic covariates. The extension allows for a more 
stringent penalty to address the (ordinary) BIC's tendency to select overly large models in this setting. The 
asymptotic scenario underlying the theoretical analysis of the new criterion allows for subexponential growth 
in the number of covariates p as a function of the sample size n, with a bound on the number of covariates 
that appear in the true mean function. The main result of Chen and Chen ( 2008 ) shows variable selection 



consistency under these asymptotics. Chen and Chen (2011 1 extend the results to generalized linear models 



(GLMs), and further improvements for linear regression are given in Luo and Chen (2011a) and in Zhang 



and Shen (2010). Consistency in Gaussian graphical models has been studied by Foygel and Drton (2010) 



and Gao et al. (2011). Composite likelihood-based criteria are treated by Gao and Song (2010). The main 



difficulty in showing consistency in these high-dimensional settings is the need to control a diverging number 
of models. We remark that a related study of the ordinary BIC that focuses on pairwise model comparisons 



is given by Moreno et al. (2010) 



In the literature on the extended BIC, the key idea for treating the high-dimensional setting is to augment 
the BIC with an informative prior on models. In the regression setting, which is our focus in this paper, the 
competing models correspond to different subsets of covariates. If p denotes the number of covariates, then 
a model corresponds to a subset J C [p] := {1, . . . ,p}, and the extended BIC is based on the prior 

(2) P{J) ' 



p+l 



that gives equal probability to each model size, and to each model given the size. Clearly, this priors favors 
an individual small model over an individual model of moderate size. More generally, priors of the form 



(3) 



P(J) oc 



^{\J\<q} 
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with a hyperparameter 7 G [0, 1] have been considered to ahow one to interpolate between the classical BIC 
of Schwarz (1978), obtained for 7 = 0, and the prior in ^ given by 7 = 1, while at the same time invoking 
an upper bound q on the number of covariates. Some of our later asymptotic results suggest that it can be 
useful to consider 7 > 1 if the number of covariates p is very large. Priors of this form have also been shown 



to be useful in fully Bayesian approaches to high-dimensional regression; compare Scott and Berger (2010) 



who motivate this and related priors in a construction that includes each covariate with a fixed probability 
that itself is given a Beta prior. 

Writing J for a particular subset of the available covariates, inclusion of the prior from ([s]) into the 
information criterion yields the extended BIC (EBIC) 

BIC^(J) = ~2£[n]{0j) + \J\- log(n) + 27I J| • log(p) . 

Under suitable conditions on the design matrix (ensuring, for instance, that removing a covariate from the 
smallest true model will substantially lower the achievable likelihood), Chen and Chen (2008 20111) show 



consistency of the EBIC in both the normal linear regression and the univariate GLM setting with fixed (i.e., 
non-random) covariates. Consistency holds as long as 7 > 1 — where k determines the rate of growth of 
the number of covariates, and thus the space of possible models, with p = 0{n'^). 

While our focus is entirely on consistency properties of model selection procedures for high-dimensional 
regression, we should mention that other properties are of interest and have been studied. For instance, |Shao| 
(1997) considers a similar problem, proving results about selection of the best predictive model, rather than 



consistency in variable selection. Jiang (2007) points out that in applied settings, the question of variable 



selection is not always well-defined due to many coefficients that are approximately rather than exactly zero, 
and considers the problem of estimating the true parameter vector under the assumption of approximate 
sparsity. This paper uses a prior on models similar to the one in ([2|. 

1.3. Recent work on Bayesian model selection in high-dimensional settings. Several recent papers 
have examined the properties of Bayesian model selection in scenarios where the model size may be large 
relative to the sample size. For instance, a consistency result for Bayesian linear regression has recently 
been obtained by Shang and Clayton (2011). This work focuses on a specific Bayesian model that assumes 



the regression coefficients to follow a particular prior distribution that is a mixture of a normal distribution 
and a point mass at zero. There has also been work on the problem of choosing between a pair of models. 



including a recent article by Kundu and Dunson (2011 ) that shows consistency of Bayesian pairwise model 



comparison under a flexible, non-parametric noise model. These results are not directly comparable to the 
problem discussed here, where we consider the problem of searching for the smallest true model from among 
a combinatorially large set of possible sparse models in a generic regression setting. 

1.4. New results. With a penalty term reflecting a particular type of prior on models, the EBIC has a clear 
Bayesian motivation. However, it is not immediately clear that the EBIC and fully Bayesian approaches 
using the same prior on models should lead to asymptotically equivalent model choice in a high-dimensional 
asymptotic scenario that has the number of covariates p grow with the sample size n. Our first main result. 
Theorem [T| addresses this issue for generalized linear models and shows that such equivalence indeed occurs 
at a fairly general level. More precisely, our result shows that a Laplace approximation to the marginal 
likelihood of each one of a growing number of models results into errors that are, with high probability, 
uniformly bounded as O ( •\/log(np)/n) . 

Our second main result, Theorem [2) pr ovides a consistency result for the EBIC. The result is very closely 
related to those of Chen and Chen| ( |2011[ ) . The primary difference is that we consider random rather than 
deterministic covariates and allow for unbounded covariates, subject to a moment condition, in some special 
cases such as logistic regression. Consistency is proven under the same conditions that we use to obtain the 
equivalence of the EBIC and fully Bayesian model selection. Combining the two Theorems yields CoroUaryjl] 
which states consistency for fully Bayesian model determination. 

Theorem [2] also allows us to obtain consistency results for pseudo- likelihood methods in graphical model 
selection, where regressions are performed to model each node's dependence on the other nodes in the graph 
("neighborhood selection" for each node), and these neighborhoods are then combined to hypothesize a 
sparse graphical model; see Meinshausen and Biihlmann (2006) and Ravikumar et al. (2010). Since in 



each regression, the covariates consist of random observations from the potential "neighbors" of the node 
in question, it is crucial that our analysis of consistency allows for random covariates. Furthermore, for 
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consistent model selection, the neighborhood selection procedure must succeed simultaneously for each node, 
and therefore we make use of the explicitly-calculated finite-sample bounds in Theorem [2] Our results for 
the graphical Ising model are given in Theorem |4] 

The remainder of this paper is organized as follows. We begin by defining the setting and introducing 
notation, in Section [2j In Section [3j we discuss Bayesian model selection and the Laplace approximation to 
the marginal likelihood. Our result on the consistency of the EBIC for regression is given in Section [4] where 
we also present experiments on real data that show good performance of the EBIC in practice. In Section 
[5j we turn to graphical models and present theoretical and empirical results showing the consistency of the 
EBIC for reconstructing a sparse graph based on a neighborhood selection procedure. We outline the proofs 
for Theorems [T] , [2] and |4] in Section |6j and give the full proofs in the Appendix. Finally, in Section [7j we 
discuss our results and outline directions for future work. 

2. Setup and assumptions 

We treat generalized linear models in which the observations of the response variable follow a distribution 
from a univariate exponential family with densities 

(X exp{y0-b(0)} , dee = R, 

where the density is defined with respect to some measure on M. More precisely, the observations of the 
response are independent random variables Yi , . . . , F„ , with Yi ~ pe^ . The vector of natural parameters 
6 = {9i, . . . , 9n)^ is assumed to lie in the linear space spanned by the columns of a design matrix X = 
(Xij) e M"^P, that is, 6 ~ Xcf) for some parameter vector (j) G W. Our focus is on the case of random 
covariates. Let X.^ be the n-dimensional vector of observed values for the jth covariate (the jth column 
of the design matrix X), and write Xi, for the p-dimensional covariate vector in the ith row of the design 
matrix X. Then we assume Xi,, . . . , X„. to be independent and identically distributed random vectors. 

Our results treat a sparsity scenario in which the joint distribution of Yi, . . . , y„ is determined by a true 
parameter vector (j>* e W supported on a (small) set J* C [p], that is, (p* if and only if j e J* . Our 
interest is in recovery of the set J* . To this end, we consider the different submodels given by the linear 
spaces spanned by subsets J C [p] of the columns of the design matrix X. We will use J to denote either an 
index set for covariates or the resulting model, for convenience. Finally, we denote subsets of covariates by 
Xij = (Xy)jgj, where J C [p] {1, . . . 

2.1. Assumptions. We will be concerned with asymptotic questions in a scenario in which n — > oo and 
the number of covariates p = p„ is allowed to grow. Let k„ = log„(p„), and let k — limsupK„ G [0, cx)]. 
Subsequently, we will suppress the sample size index and write p rather than p„. Our theorems apply to 
either one of the following cases (recall that Xij, . . . , Xnj are identically distributed): 

(Al) The covariates are bounded (or bounded with probability one), that is, there is a constant A < oo 

such that, \Xij\ < A for j = 1, . . . ,p. 
(A2) There is an even integer K > 2k (in particular, k < oo), for which the covariates have moments 

bounded as E [jXijI^^] < Ak < oo for j — 1, . . . ,p. Moreover, for all i > 0, it holds that 



Bif (i) := supExi 
j 



sup \h'"{e)\ 

W\<t\Xi,\ 



< oo 



We remark that the condition on the third derivative of the cumulant generating function b holds for logistic 
and Poisson models, as well as the normal model with known variance. In addition, we assume the following 
five conditions: 

(Bl) The growth of p is subexponential, that is, log(p) = o(n). 

(B2) The size of the true model given by the cardinality of the support J* of the true parameter vector 

(j)* is bounded as | J*| < g for a fixed integer g G N. 
(B3) All small sets of covariates have second moment matrices with bounded eigenvalues, that is, for some 

fixed finite constants ai,a2 > 0, it holds that oilj ^ E [X^^jXjj] ^ a2lj for all |J| < 2q. 
(B4) The norm of the true signal is bounded, namely, ||^f'*||2 < 03 for a fixed constant < 03 < 00. 
(B5) The small true coefficients have bounded decay such that 



log(np) . . J. . . 
= o (mm I |0j I : J e J | j 



MODEL SELECTION IN SPARSE GENERALIZED LINEAR MODELS 



5 



2.2. Comparison to assumptions used in existing work. We compare the above assumptions to those 
used by Chen and Chen ( |2011 ), who show that the EBIC is consistent for univariate GLMs with fixed covari- 



ates. In this work by Chen and Chen, only bounded covariates are considered — that is, our (Al) scenario. 
Our conditions (Bl), (B2), and (B4) appear (explicitly or implicitly) in their work as well. Condition (B5) 
appears in a stronger form in their work, where they assume that (f>* is fixed and therefore its minimal 
nonzero value is bounded from below by some constant. 



A crucial difference lies in our assumption (B3). The analogous condition of Chen and Chen (2011) 
requires that, for some positive finite Ai and A2, Ailj ^ n~^Hj[(j)j) < A2IJ, for any J D J* with \ J\ < 2q 
and any in a neighborhood of <f>*j. Here Hj{-) := (iJ[„](-)) is the Hessian of the negative log-likelihood 
function (restricted to rows and columns corresponding to the covariates in J). Note that Hj{-) depends on 
the design matrix Xj for the given set of covariates J. Since we work in the setting of random covariates, 
we cannot make this assumption on the empirical design matrix, and therefore use condition (B3), which is 
a weaker assumption on the distribution of the covariates. 

3. BAYESIAN model SELECTION 

Observing the independent random vectors (Xi., Yi), . . . , (X„., Y^), the likelihood function of the consid- 
ered GLM is 

L[„](<^) = exp{^[„](0)} =exp|f^^,(<^)| -exp|fjy,.X,:^<^-b(X,:^(/))| , 

with £i{(j)) being the log-likelihood function based on the ith observation {Xi,,Yi). Let P{J) be the prior 
probability of model J C [p], and let fj{(f>j) be a prior density on the model's parameter space M'^. The 
unnormalized posterior probability of model J is then 



Bayes( J) = P(J) • / L[„ 



where, with some abuse of notation, we write L[„]((/)j) for the likelihood function of the model given by 
J C [p], that is, 

L[„](0j) = exp |^^^(0j)| = exp jf^K, • Xfjcj^j - b(Xf>j)| . 

Our interest is now in the frequentist properties of the Bayesian model selection procedure that chooses a 
model J by maximizing the (unnormalized) posterior probability Bayes(J). Assuming that observations are 
drawn from a distribution in the GLM, we ask the following two questions. First, is the Bayesian procedure 
consistent, that is, will it choose the smallest true model in the large-sample limit? Second, how can we 
approximate the marginal likelihood integral appearing in Bayes(J), without introducing approximation 
errors that might change which model is selected? In the classical scenario with a fixed number of covariates 
p, when considering a growing sample size n, the answers to the above questions are tied together. Under 
suitable conditions, consistency of the Bayesian procedure can be established by proving that, for large 
samples, it selects the same model as the consistent BIC or the more accurate approximation obtained by 
applying the Laplace approximation to the marginal likelihood integral. We will show these same connections 
to exist in sparse high-dimensional settings. 

Theorem [1] below states that, under appropriate conditions, the Laplace approximation to marginal like- 
lihood integrals remains uniformly accurate across large spaces of models. This result is obtained under an 



upper bound q on the model size. As discussed in Section 1.2 in the introduction, we give special emphasis 



to a particular class of prior distributions on the set of models, namely, priors of the form 



1 7 
p ^ 



(4) P(J)(x(^|-j •1{|J|<9}, Jc[p], 

for some 7 > 0. We write Bayes^(J) for the unnormalized posterior probability associated with the choice of 
prior P{J) — ■ 1 {| J| < q], where we suppress the normalizing constant in the prior for convenience. 

Then we show that, for sufficiently large n, the event 

arg min BICt,( J) = arg max Bayes^( J) . 

\J\<q \J\<q ^ 
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occurs with high probability. In other words, the EBIC 

(5) BIC^(J) = -21ogL[„](^j) + |J|log(n)+27|J|log(p) . 

yields an approximation to the Bayesian posterior probability that is accurate enough for the resulting model 
selection procedures to be asymptotically equivalent. In fact, Theorem [T] states a stronger result according 
to which Bayes^(J) is approximated up to a constant by BlCj{J). Finally, we prove consistency of the EBIC 
in Section [4] In combination with the results of this section, we obtain a proof of the consistency of the 
Bayesian model selection procedure. 

We now give the precise statement of the points just outlined. We adopt the notation a = 5(1 ± c) to 
conveniently express that a belongs to the interval [b{l — c), b{l + c)]. 

Theorem 1. Assume that conditions (B1)-(B5) hold, and that either assumption (Al) or (A2) holds. 
Moreover, assume the following mild conditions on the family of priors {fj : J C [p], | J| < g), which require 
the existence of constants < Fi, F2, < 00 such that, uniformly for all \J\ < q, we have 

(i) an upper bound on the priors: 

s^P^jfj{<l>j) < Fi <oo, 

(ii) a lower bound on the priors over a compact set: 

Il0j||2<i?+i/j('^-/) > ^2 > 0, 
where R is a function of the constants in assumptions (Al) or (A2) and (B1)-(B5), defined in the 
proofs, 

(iii) a Lipschitz property on the same compact set: 

SUP||0j||2<H+i W"^ f. ]{(}). j)\\2 <F3<00. 

-I 1/2 3/2 

Then there is a constant C , no larger than 4F3F2 Xi + 2qX3X^ + 2, such that, for sufficiently large n, 
the event that 

Bayes(J) = P(J) • Li^^iMfAM ■ \Hj{h)['^^ (2^)'"/^ • (^l ± C^j^^^^ for all \J\ < q 

occurs with probability at least l — {np)^^ under (Al), and with probability at least l — (np)^^—4:K^^^n 2 
under (A2). In particular, for the (unnormalized) prior 

^(^) = (|J|) • 1 ^ '^oZds that 

|log(Bayes^(J)) - (-iBIC^(J))| < Ci , 

where Ci is a constant no larger than | log(27r) + jq\og{2q) + glogmaxjAj""'^, A2} + logmaxji^i, i^2~^} + ^■ 

The proof of this theorem is given in Section [6j The constant R appearing in conditions (ii) and (iii) on 
the family of priors arises in the proof, where we show that with high probability, the MLEs (pj for all sparse 
models J will lie inside a ball of radius R centered at zero. 

4. Consistency of the extended Bayesian information criterion 

Let J* be the smallest true model; recall Section [2] We now show that the extended BIC from ([s]) 
satisfies that, with high probability, BlC-y(J) > BIC^(J*) for all J 7^ J* with |J| < q, as long as the 
penalty on model complexity is sufficiently large. Specifically, we require 7 > 1 ~ 577 (and 7 > 0), where 
K = limsupKn = limsuplog„(p„) S [0,oo]. 

Our main consistency result, stated next, is very similar to the consistency results of |Chen and Chen| 
(2011) but treats random instead of deterministic covariates. 

Theorem 2. Assume that conditions (B1)-(B5) hold, and that either assumption (Al) or (A2) holds. Choose 
three scalars a,/?, 7 to satisfy 

\ a e (0, i) and P > Q, if k = 0. 
Then, for sufficiently large n, the event 

(6) BIC,(J*) < (^^ min ^^BIC,(,/)) - log(p) • (7 - (l - ^ + ^ + 
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occurs with probability at least 1 — n ^ under (Al) or at least 1 — 4K^^^n 2 —ji°'p0 under (A2). 
In particular, the EBIC is consistent for model selection, whenever 7 > 1 ^ 577 • 

Combining Theorem [2] with Theorem [T] which showed the equivalence of EBIC-based and Bayesian model 
selection, we obtain the following corollary. 

Corollary 1. Assume that conditions (B1)-(B5) hold, and that either assumption (Al) or (A2) holds. 
Choose three scalars a, /3, 7 to satisfy 




Then, for sufficiently large n, with probability at least l~n^°'p^^ under (Al) or at least 1^ iK^^^n 2 — 
n^°'p~^ under (A2), 

Bayes^fJ*) > min Bayes^(J) . 

J^J',\J\<q 

In particular, Bayesian model selection is consistent, whenever 7 > 1 ~ 5^ • 

4.1. Selecting from a set of candidate models. In practice, it is not computationally feasible to calculate 
either the Bayesian marginal likelihood or the EBIC for every possible sparse model, since even if the model 
size bound q is relatively small, the number of possible models is very large, on the order of p"^ . Furthermore, 
the size q of the smallest true model is not known in general. Typically, the BIC (or another selection 
criterion) is applied only to a manageable number of candidate models, obtained via some other method. 



In the sparse regression setting, the Lasso (Tibshirani 1996) has been demonstrated to be very effective at 



recovering sparse linear and generalized linear models (Friedman et al. 2010). The Lasso selects and fits a 
model by solving the convex optimization problem 

(7) r = arg rnm { - ^£,(0) + p||<^||i} , 



where = J2j I'f'jl the vector 1-norm and p > is a penalty parameter. For an appropriate choice of 
p and under some conditions on the covariates and the signal, the Lasso is known to be consistent for linear 



regression; compare Chapter 6 of Biihlmann and van de Geer (2011). The optimal choice of p suggested 



by theory depends on unknown properties of the distribution of the data, and is therefore unknown in an 
applied setting. A common approach to the problem is to fit the entire "Lasso path" of coefficient vectors (pP 
for p in the range [0, 00), thus producing a list of candidate sparse models { Ji, J2, . . . }, and then to select a 
model from this list using a technique such as cross-validation or the BIC. By Theorem [2] with probability 
near one (for large n), BIC^(J*) < BlC-y(Jm) for any sparse model Jm 7^ J* in the candidate set. Therefore, 
if the smallest true model J* is in the candidate set, we will be able to find it with high probability by 
applying the EBIC to every candidate model. 

4.2. Experiment for sparse logistic regression. We compare the BIC, the extended BIC with 7 = 0.25 
and 7 = 0.5, and 10-fold cross-validation on the task of selecting a logistic model for distinguishing between 
spam and legitimate emails. We compare also to stability selection ( jMeinshausen and Biihlmann, ,2010^ , 
a recent alternative approach to the problem of sparse model selection that applies the Lasso repeatedly 
to subsamples of the data, and then chooses covariates to include in the model based on whether they are 
"stable" , that is, whether they appear consistently over the repeated samples. 

4.2.1. Data and meth ods for model selection. We used the Spambase data set from the UCI Machine Learn- 
ing Data Repository (Frank and Asuncion 2010[ )p] The data is drawn from 4,601 emails, and consists of a 



binary response (spam or non-spam classification), along with predictors measuring the frequency of certain 
words and characters in the email, and several other predictive features, for a total of 57 real-valued covari- 
ates. To create a challenging setting where the number of covariates is large relative to the sample size, we 
first randomly sampled a subset 5* C {1, . . . ,4601}, for various sample sizes n = \S\. We then created fake 
covariates by permuting the true features, in order to allow p to grow with n. We ran 100 repetitions of each 
of the settings shown in Table [T] 



^Available at http://archive.ics.uci.edu/ml/datasets/Spaiiibase 
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Table 1. Settings for the spam email experiment. 



n 


P 


=//= 11 uc 

features 


^ permuted 
features 


100 


57-4 


57 


57-3 


200 


57-8 


57 


57-7 


300 


57-12 


57 


57-11 


400 


57-16 


57 


57-15 


500 


57-20 


57 


57-19 


600 


57-24 


57 


57-23 



Table 2. Positive selection rate and false discovery rate in the spam email experiment. 





n — 
PSR 


100 
FDR 


n = 
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300 
FDR 


n — 
PSR 


400 
FDR 


n — 
PSR 
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FDR 


n = 
PSR 


600 
FDR 


BICo.o 


14.12 


10.95 


19.37 


18.04 


23.39 


20.04 


26.40 


20.79 


30.79 


22.86 


33.46 


19.81 


BICo.25 


8.65 


2.18 


11.33 


0.92 


15.11 


1.49 


17.42 


1.00 


20.21 


3.03 


22.35 


2.75 


BICo.5 


6.37 


0.27 


8.82 


0.00 


11.00 


0.00 


13.33 


0.00 


14.77 


0.24 


16.60 


0.00 


Cross- val. 


6.89 


23.54 


13.67 


36.67 


19.68 


46.67 


30.30 


50.80 


37.44 


56.32 


38.16 


59.48 


Stability sel. 


3.11 


0.56 


6.56 


2.35 


8.56 


2.01 


10.96 


2.95 


12.05 


4.05 


13.65 


4.07 



Let Yi be the class label with = 1 if email i is spam, and li = otherwise. Let Xij be the value of 
the jth covariate for the zth email. For each {n,p) pair, we performed the following steps. We first drew 
a subsample S = {ii, . . . , i„} C {1, . . . , 4601} uniformly at random, and define the response vector to be 
(Kij , . • ■ , We then randomly chose permutations <ti, . . . , (Jk of {1, . . . , n}, where K is chosen to obtain 

the desired total number of covariates, i.e. p = 57 - (1 + K). We define the design matrix 



X, 



11,1 



X. 



ii,57 



X. 



X, 



X,„ 



■1(1)' 



X,, 



x. 



1(1) 



,57 



):57 



X, 



X,, 



X,, 



X,, 



■Jfd) 



,57 



.),57 



which contains one block of 57 true features, and K blocks of 57 permuted (fake) features. 

To evaluate the BIC, the EBIC, and cross-validation on this data, we first generated models by applying 
the logistic Lasso with a range of 100 penalty-parameter values to the data, using the glmnet package 
(Friedman et al. 2010) in R (R Development Core Team 2011). This produced a list of 100 (possibly not 



distinct) support sets, Ji, . . . , Jioo- For the BIC and the EBIC, we refitted each candidate model J™ using 
the function glm in R, and applied BIC^ with 7 = 0.0,0.25,0.5 to each candidate model, to select a single 
model for each BIC^. We also applied 10- fold cross-validation, selecting the single model from the list of 
candidate models that minimizes average error on the test sets over the 10 folds. 



Finally, for stability selection, we used the stabsel function in the mboost package (Hothorn et al. 
in R, with expected support set size q = 50. As noted by Meinshausen and Biihlmann 



2009) 



(2010), changing the 



settings within a reasonable range did not have a large effect on the output. 

4.2.2. Results. We evaluate the methods based on their ability to distinguish between the 57 true and the 
remaining false (permuted) features. Table[2]and Figure[l]show the positive selection rate (PSR) and the false 
discovery rate (FDR) for each of the five methods in this task, over the range of sample sizes. As customary, 
PSR is defined as the proportion of true features selected by the method, and FDR is the proportion of false 
positives among all features selected by the method. 

Comparing the three BICs to cross-validation, we observe that cross-validation can recover more true 
features (for larger values of n), but at an unacceptably large increase in the FDR. The original BIC 
performs better but still exhibits a high FDR. In contrast, the FDR of the EBIC with either 7 = 0.25 or 
7 = 0.5 remains very low at all sample sizes; the associated PSR is smaller but increasing with the sample 
size. Stability selection performed similarly to the EBIC with 7 — 0.5 in this experiment, but with slightly 
lower PSR and slightly higher FDR. Overall, it seems that the EBIC with 7 = 0.25 performed best at the 
task of identifying the 57 true features, with a very low FDR and a moderately good PSR. 
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Figure 1. Results for the spam email detection experiment. 



The rather low PSRs observed in the simulations are due in part to the fact that the 57 true features 
are not necessarily all strongly relevant to the response. To account for this in our evaluation of the five 
methods, we ran a logistic regression using the full data set (with a sample size of 4,601 emails) using the 
glm function in R, and extracted the p- values for each feature. For each method, using the models selected 
by the method over 100 repetitions of the experiment with {n,p) = (600, 57- 24), we use Gaussian smoothing 
(scale: standard deviation = 0.1, on the p- value scale) to estimate, as a function of t, the probability that the 
method will select a true feature with p- value t. The estimated functions are plotted in Figure [2] (The rate 
of selection of false (permuted) features is not shown in this figure.) We see that the function estimates for 
cross-validation and the BICs each decay steadily with p-value, which seems desirable. In this experiment, 
stability selection appears to distinguish less clearly between highly and moderately relevant features, if we 
accept the p-values as a reasonable measure of relevance. 



5. Edge selection in sparse graphical models 

In many applications, sparse graphical models are used to analyze data arising from multivariate obser- 
vations with sparse dependency structure. In the setting we treat here, an undirected graph G consists of a 
set of nodes V representing the observed variables, and a set of undirected edges E G V x V representing 
possible conditional dependencies between pairs of nodes. Specifically, if two of the variables do not have an 
edge between their corresponding nodes, then they are conditionally independent given all other observed 
variables. The problem of graphical model selection consists in selecting an appropriate set of edges to 
include in the graph that represents the dependency structure among the observed variables. 



In Section [STTj we introduce different approaches to this edge selection problem. In Sections 5.2 and 5.3 



we discuss existing and new theoretical results for two commonly used classes of sparse graphical models. 



5.1. Sparse graphical models. Suppose we observe n independent and identically distributed random 
vectors in M^, denoted Xi, — {Xn, . . . , Xip), for i = l,...,n. For each graph G on the set of nodes 
V = {1, . . . ,p}, associate the model A4g comprising all distributions for which the conditional independence 
constraints implied by G are satisfied. We are then interested in the recovery of the graph G* that encodes 
the dependency structure in the common true distribution of Xi,, . . . , Xn,. 
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Figure 2. Smoothed probability of selecting a true feature, as a function of the p-value of 
that feature in the full regression. 



Since optimizing over the set of all (sparse) graphs is computationally infeasible, £i-norm penalization 
methods have been considered. These 'graphical Lasso' procedures maximize the sum of the log-likelihood 
function and the absolute values of the relevant interaction parameters. As in the regression problem in ([7|, 
a tuning parameter p is introduced to allow for the necessary trade-off between log-likelihood function and 
penalty term. This approach is the most tractable for the Gaussian case in which the penalty is the sum 



of the absolute values of the off-diagonal entries of the precision matrix O (Banerjee et al. 2008 Friedman 



et al. 



2008). With the .^i-norm promoting sparsity in the estimate Qp, a graph estima te G^ ^ 
obtained by including an edge between nodes j and k whenever 9^,,. ^ 



can be 



Ravikumar et al. 



(2011) show that, 



under eigenvalue and irrepresentability assumptions on the true precision matrix G* 
asymptotically consistent for a suitable sequence of values of p. 



the estimate C^,,^,^ is 



A similar approach is the neighborhood selection method of Meinshausen and Biihlmann ( 2006 ) , which 
performs penalized regression for selecting each node's neighborhood. Specifically, for each variable j, we 
optimize a penalized conditional likelihood function to find 



argmin^^l - ^logP(X,,|{X,fc : k ^ j}, + |li} 



We then define the graph estimate G'^ 



to have an edge between nodes j and k whenever and are 

This method inherits 



, „ _ „_ . _ .^P ov,^ rp 

neighbor 

both nonzero (the and rule), or whenever either I3^i^ or is nonzero (the OR rule) 
asymptotic consistency properties from results for the individual regressions. 

Both of the above methods require choosing the tuning parameter p. Similarly, greedy search over all 
graphs requires a choice of a sparsity bound q, or alternately, a stopping criterion to indicate when enough 
edges have been added. In each case, we can rephrase the tuning problem as the question of selecting a 
model from a small list of candidate graphs Gi, . . . , Gm, of various sparsity levels. 

We can use cross-validation to select a model from this list, but there are two disadvantages. First, K-iold 
cross- validation can be computationally expensive due to the process of fitting models to K different parts of 
the data. More importantly, from the point of view of graph recovery, cross-validation tends to choose overly 
large models leading to selection of many false positive edges, in the high-dimensional setting when p ^ n; 
compare Foygel and Drton (2010). As for regression, we can alternatively use stability selection (Meinshausen 



and Biihlmann 2010), where we search for edges that are stable across sparse models fitted to subsamples 



of the data using graphical Lasso or neighborhood selection; see also Liu et al. (2010). This method has 
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been shown to be asymptotically consistent in a range of settings. However, it again requires refitting the 
model many times for different subsamples. Finally, as a third approach, we may apply information criteria, 
and we now turn to two specific settings where the extended BIC yields a computationally inexpensive and 
asymptotically consistent procedure for edge selection. 

5.2. Gaussian graphical models. Suppose the i.i.d. observations Xi,, . . . , Xp, are multivariate normal 
with precision (or inverse covariance) matrix Q. Then it is well known that Xij and Xik aire conditionally 
independent given the remaining variables {Xi : I ^ j, k} if and only if Ojk = 0. The Gaussian graphical 
model A^G associated with an undirected graph G on nodes V — {1, . . . ,p} is the set of all multivariate 
normal distributions with Qj^ = when j and k are two distinct non-adjacent nodes in G. 



Prior work proposes the use of the extended BIC for sparse Gaussian graphical model selection ( Foygel 
and Drton 2010 Gao et al. 2011). Accounting for a matrix parameter, the EBIC is defined as 



BIC^(G) = -2^[„](eG) + \G\ ■ log(n) + 4|G|7 • log(p), 

where £[„](6g) denotes the maximized log-likelihood function for the set of n observations, and \G\ is the 
number of edges in the graph. Since each model is only fitted once (to the full data set) , this method carries 
relatively low computational cost, while enjoying consistency properties. We now state a version of the main 



theorem from Foygel and Drton (2010), which gives conditions under which minimization of the EBIC leads 



to selection of the smallest true model G* when applied to any list of sparse decomposable graphs containing 
G*; for a definition of decomposable graphs we refer the reader to Lauritzen (19961. 



Theorem 3. Suppose that the true graph G* is decomposable with \G*\ < q, and that the true precision 
matrix Q* € has hounded condition number and minimum nonzero value 9o bounded away from zero. 

Suppose that p (x n'^ for some n < 1, and that the true neighborhood size is bounded for each node. Fix any 
7 > 1 — Then with probability tending to one as n — > oo, 



BIC-y(G*) < min{BIC-y(G) : G is decomposable with \G\ < 9} ■ 



Together with consistency results on the graphical Lasso and on neighborhood selection, this result implies 
that combining EBIC and either graphical Lasso or neighborhood selection gives a consistent method for 
edge selection under the assumptions stated. While our proof of the theorem relies on exact distribution 
theory applicable to decomposable graphs, we conjecture that the stated result holds without the restriction 
to decomposable graphs. 



Gao et al. (2011) propose EBIC-based tuning of the so-called SCAD penalization method for graphical 
model selection and give a consistency result taylored to this method. The version of the EBIC studied by 
these authors has the maximum likelihood estimator replaced by the SCAD estimator, and the model search 
is restricted to a subset of the SCAD regularization path. No decomposability assumptions were needed by 



Gao et aL (2011) 



5.3. Ising models. In the setting of binary observations Xi, 
probability mass functions of the form 



, X„, G {0, 1}^, the Ising model consists of 



(8) 



"((-''^iij ■ ■ • I ^ip) — {^1 



3)) oc exp I ^ Cj 



2 5Z ^jk^j^k 



where Q G MP is any vector, and for identifiability we constrain e MP^'p to be a symmetric matrix with zero 
diagonal. This model originated in physics to model states of particles, where informally we have Qjk > 
if particles j and k prefer to be in the same state, and Qjk < if particles j a nd k prefer to be in different 



states. For background and applications, compare e.g. Kindermann and Snell ( 1980 ) 



In the Ising model, the conditional distribution of Xij 
model — from (Isl), we obtain 



given {Xik '■ k j} comes from the logistic 



(Xij = Xj\{Xik = Xk ■ k^ j}) cx exp | (^Q + ^ QjkXk^Xj^ 



and therefore the log-odds are 
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To recover the true graph G* that describes the dependencies among the variables (or equivalently, the 
sparsity pattern in the true matrix Q*), we can thus use neighborhood selection with the logistic Lasso, 
which finds 

= argmiuft. | - ^ logP(X,,|{X,fc : k ^ j},/3,) + p||/3,||i| 

= argminft.| - X.^ • ipjo + ^X,kPjk \ + log M + exp |/3jo + ^^^^fc/^jfe} ) + pWjW 
The resulting graph estimate has an edge between nodes j and k based on the values of /3^j. and /?^^ , 



using either an AND or an OR rule; compare also Hofling and Tibshirani ( 2009 1 



Tuning the parameter p can be done using the EBIC for logistic regression. Our results for consistency 
of the EBIC for logistic regression then imply consistency guarantees for neighborhood selection with EBIC 
tuning. We assume that the following conditions hold (for constants q and c): 

(CI) The growth of p is subexponential, that is, log(p) — o(n), with n :— lim sup log„ (p) e [0, oo]. 

(C2) The true graph G* has degree bounded by q, that is, each node j has a neighborhood of cardinality 
\{k:{j,k)eG*}\<q. 

(C3) The true parameters are bounded with maxj|^*| < c and maxj < c. 

(C4) The signal is bounded away from zero such that 



log(np) 



min 19*1,. 



The following theorem gives a precise statement of the consistency properties of the EBIC for edge selection 
in the Ising model. 

Theorem 4. Assume that conditions (C1)-(C4) hold. Let Xi,, . . . , X„, e {0,1}^ be i.i.d. draws from an 
Ising model with parameters G W and O* G M''^'', where Q* is symmetric with zero diagonals. Let G* 
be the graph with edges indicating the nonzero entries of Q* , and for each node j, let S* denote its true 
neighborhood, that is, Sj — {k ^ j : 8*^, ^ 0}. Choose three scalars a,(3,j to satisfy 

\ a e (O, \) and p > Q, if k = 0. 
Then, for sufficiently large n, the event that the inequalities 

BIC^(5;) < min {BIC^(5,) : S, ^ j, S, ^ S* , \S,\ < q} ~ log(p) . (^7 _ (^1 - i- + ^3 + ^ 

hold simultaneously for all j has probability at least 1 — n^"p^^^^^-'. In particular, the EBIC is consistent 
for neighborhood selection (simultaneously for all nodes) in the Ising model, whenever 7 > 2 — 

5.4. Experiment for the Ising model. We compared the BIC, the EBIC with 7 = 0.25 and 7 = 0.5, 



10- fold cross-validation, and stability selection as in iMeinshausen and Biihlmann (2010) on the task of edge 



selection under an Ising model for precipitation data from weather stations across four states in the midwest 
region of the U.S.: Illinois, Indiana, Iowa, and Missouri. Performance is measured relative to the true 
geographical layout of the weather stations, which is "unknown" to the procedures we compare. 

5.4.1. Da ta and methods for model selection. We used data from the United States Historical Climatology 



Network (Menne et al. 2011| The data consists of weather-related variables that were recorded on a daily 



basis. We specifically gathered the precipitation data, which gives the total amount of precipitation for each 
day. Trying to limit the effects of temporal dependencies between successive observations, we took data from 
the 1st and 16th of each month. These are then treated as independent. We removed weather stations where 
data availability was low and discarded observations with missing values for any of the remaining weather 
stations. A total of 278 days and 89 stations remained in the final data set. Next, we hypothesized a "true" 
graph by computing the Delaunay triangulation of these 89 weather stations, based on their geographic 



^Available at http://cdiac.ornl.gov/ftp/ushcii_daily/ 
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Figure 3. Delaunay triangulation for 89 weather stations in Illinois, Indiana, Iowa, and Missouri. 
Table 3. Positive selection rate and false discovery rate in the weather data experiment. 
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BICo.o 


50.99 


47.13 


36.36 


30.83 


BICo.25 


45.45 


39.47 


30.04 


28.97 


BICo.5 


42.29 


33.95 


23.32 


26.25 


Cross-validation 


69.17 


76.42 


59.29 


65.83 


Stability selection 


24.51 


26.19 


13.44 


26.09 



locations, using the delaunay command in MATLAB (20101. Figure |3] shows a map with the resulting 
undirected graph. 

For each weather station j, we define binary variables Xij taking values 1 or depending on whether or 
not there was a positive amount of rainfall at weather station j on day i. For each one of the stations j, we 
then applied each of the five methods to perform a sparse logistic regression that has response vector X,j 
and covariates {X.f, : k ^ j}. Our method for selecting a neighborhood for weather station j, for each of 



the five methods, is identical to our methods for the regression experiment on email data (see Section 4.2.1 1. 
Finally, we combined each method with the OR rule and with the AND rule to produce a sparse graph, for a 
total of ten methods. 

5.4.2. Results. To evaluate the methods, we first treat the graph obtained via the Delaunay triangulation 
as the "true" underlying graphical model. Table [3] shows the results for each method, stated in terms of 
positive selection rate (PSR) and false discovery rate (FDR), relative to the "true" Delaunay triangulation 
graph. These results are also displayed in Figure |4j while the graphs in Figure [5] show the recovered graphs 
for each of the methods, combined with the and or OR rules. 

We see that cross-validation leads to a PSR that is somewhat higher than that of the other methods, 
under either an and or an OR rule. However, this comes at a drastically higher FDR. For the EBIC, as 
we increase 7, we reduce the FDR at a cost of a lower PSR, as expected. Stability selection appears to 
be a more conservative method than BIC-y for 7 — 0.0,0.25,0.5, with lower FDR and lower PSR, and was 



substantially more computationally expensive. While not shown, setting 7 
similar results to stability selection, in this experiment. 



1.0 with the EBIC yielded very 



14 



RINA FOYGEL AND MATHIAS DRTON 

Results for neighborhood selection 



CD 



• BICo (OR rule) 

O BICo 25 (OR rule) 

O BICo.5 (OR rule) 

O cross-validation (OR rule) 

9 stability selection (OR rule) 

▲ BICo (AND rule) 

A BICo 5 (AND rule) 

A BICi (AND rule) 

A cross-validation (AND rule) 

A stability selection (AND rule) 



A 



O 



O 



T" 



T" 



T" 



T" 



I 

100% 



0% 



20% 40% 60% 80% 

False discovery rate (FDR) 



Figure 4. Performance of each method, under the OR rule and the AND rule, where the 
true graph is defined via the Delaunay triangulation. 



The edges of the Delaunay triangulation likely capture the strongest dependencies, but it is reasonable to 
expect additional dependencies that are not captured by the edges in the triangulation. One way to compare 
the methods without referring to the Delaunay triangulation is to use the geographic distance between each 
pair of weather stations. For each method, we use Gaussian smoothing (scale: standard deviation = 10 miles) 
to estimate, as a function of d, the probability that the method will infer an edge between two nodes that 
are d miles apart. The resulting functions are plotted in Figure |6j where we also show the same smoothed 
function calculation for the graph defined by the Delaunay triangulation. 

We observe that the smoothed function for the cross-validation methods (under either the OR or the AND 
rule) does not decay to zero as distance increases. That is, in this experiment, the cross-validation methods 
tended to select some positive proportion of edges between nodes that are arbitrarily far apart, which is 
undesirable. To a lesser extent, the same problem occurs for the (original) BIG combined with the OR rule. 
The other methods, in contrast, yield functions that do decay to zero as distance increases. We see also that 
for two nearby weather stations, the extended BIG with 7 — 0.25 or 7 = 0.5 combined with the OR rule, are 
both significantly more likely to select an edge than the remaining methods, which are more conservative. 
Overall, the performance of the extended BIG compares favorably to the other methods, with a moderately 
good rate of edge selection for nearby weather stations, and with probability of edge selection decaying to 
zero when the distance between a pair of weather stations is large. 



6. Proof sketches for theorems 

To prove our Theorems, we use Taylor series to approximate log-likelihood functions, and Laplace approx- 
imations to approximate integrated likelihoods. In Section [6. 1| we introduce notation and state two technical 
lemmas bounding various quantities relating to the log-likelihood function. In Sections |6.2[ |6.3[ and |6.4| we 
outline the proofs of Theorems [T] [2] and |4j respectively. Full proofs are in the Appendix. 
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Figure 5. Graphs recovered under each method. (Black edges indicate true positives, red 
edges indicate false positives, and light gray edges indicate false negatives, i.e. true edges 
that were not recovered by the method, where the true graph is defined via the Delaunay 
triangulation.) 




Figure 6. Smoothed probability of selecting edges as a function of distance, for each 
method under the OR rule and the and rule. 



6.1. Preliminaries. Let S[„] (0) = V log (0) € W be the gradient of the log likelihood (or score) function, 
and let _ff[„](0) — — Vs[„]((/)) € M.p^p be the Hessian. Write sj(0) and Hj{(t)) to denote the sub-vector and 
sub-matrix, respectively, indexed by j G J. 
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The following lemma gives bounds that will be important in the proofs of the Theorems. 



Lemma 1. Fix any a,/3 > 0. Assume (B1)-(B5) hold, and that either (Al) or (A2) holds. For suffi- 
ciently large n, with probability at least 1 — n~"p~^ under (Al), or with probability at least 1 — n^°'p^^ — 
AK^^^n 2 under (A2), the following statements are all true. The symbols Ci, C2, A|, t, R, Ai, A2, and 
A3 appearing in the statements represent constants that do not depend on n, p, or on the data, but generally 
are functions of other constants appearing in our assumptions. 

(i) The gradient of the likelihood is bounded at the true parameter vector (f>* : 



(Hj{(j)*)-'/^-^ sjict)*) ^ < y^2(l + e„) |J\J*|log (?iV+'^) for all J D J* with \ J\ < 2q , 



_^ where er, ^C,^'-^^^i^^+C,^=o{l). 
(ii) Likelihood is upper-bounded by a quadratic function: 



log 



Ll 



L, 



< 



ijjh min{l,||^j||2}- 



log {n"p 



for all \J\ < 2q, tpj G 



(iii) For all sparse models, the MLE lies inside a compact set: 

||0,/||2 < R for all \ J\ < 2q . 

(iv) The eigenvalues of the Hessian are bounded from above and below, and local changes in the Hessian 
are bounded from above, on the relevant compact set: 

For all \J\ < 2q, ||0j||2 < R+l, Xilj ^ ^Hj{c^j) ^ A2I/ , 

and for all Uj^, Wjh < ^ + 1, ^ {Hji<f>j) - Hj{c^'j)) <Uj- cp'j^X^lj . 

We now state a second Lemma, which relates specifically to lower-bounding the eigenvalues of the Hessian, 
and may be of independent interest, as it holds under much weaker assumptions than those used in our other 
results. 



\X 



Lemma 2. Fix J with \ J\ — 2q, and radius R > 0. Assume Amin (E [Xj^jX^j]) > ai > and sup -g J E 
m. If n is sufficiently large, then with probability at least 1 — e^^^^^'^^^'^ 1 ) for all cf) — cf) j with 

m2<r. 



H.j{(t)) > nlj ■ ^ inf {h"{e) : \e\ < 20q^ry/7K \80q^ma^^] } . 
6.2. Proof outline for Theorem [l] The key bound in the proof is showing that 



L[n]i(t>j)fj{(l}j) d(j)J = Lyn-\{(t>j)fj{(t)j) 



H 



-1/2 



(27r) 



'I/2 



for all J with \J\ < q. To this end, we calculate the marginal likelihood in each model J by splitting the 
integration domain into three regions — a small neighborhood of the MLE called Afi, a larger region Af2\J^i 
obtained from a larger neighborhood Af2, and the remainder of the space, given by M'^\A/'2. 
Fix any model J with \J\ < q, and let (j>j be the MLE. Define the neighborhoods 

Hj(^.jY''{cj)j - < v/41og(np)} , 

2 J 

Then the marginal likelihood is the sum of the three integrals 



AA2 



(Intl) 
(Int2) 
(Int3) 



L[n\{4>j)fj{(l>j) d(t>J 



L\nMj)fj{<l>j) d(j)j , 



(t>jm-'\N'2 



L\nMj)fj{(t>j) d(f>j . 



< 
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The bulk of the proof now consists of computing approximations to each of the three terms separately. 

It is at first surprising that we split into three regions, rather than two, as is done in other work with 
'fixed p.' The intuition for our split is as follows: 

(Intl ): In the smallest region. Mi, a quadratic approximation to the log- likelihood function is extremely 
accurate, and we can use it to prove the accuracy of the Laplace approximation for this part of the 
integral. 

(Int2): In the intermediate region A/'2\A/i, while the quadratic approximation to the log- likelihood 
function is no longer very accurate, we can still obtain a quadratic upper bound. Hence, the integrand 
behaves as e"'^"'^-'"'^-'"^ for an appropriate constant c, meaning that we can use tail bounds for the 
distribution to prove that the contribution of this region is negligible. 

(IntS): Outside oi M2, the quadratic approximation may no longer be accurate enough to use the same 
reasoning as for the intermediate region. However, due to convexity of log-likelihood function, the 
integrand is at most roughly e^^ \\<t>j-<t>j\\-2 fQj- appropriate constant c'. (Note that the quantity in 
the exponent is no longer squared.) Therefore, we can use tail bounds for an exponential distribution, 
to show that the contribution from this third region is also negligible. 

Exponential tail bounds are much weaker than those for the distribution, which explains why we 
separate the area outside of A/i into two regions — first, the intermediate region M2\Mi which contains points 
that are relatively close to the MLE, for which we can apply strong tail bounds, and second, a region IR'^\A/'2 
where we can only apply weaker tail bounds, but where all points are rather far from the MLE. 

6.3. Proof outline for Theorem [2[ Lemma [T] deals with issues arising from random covariates. Given 
the results of Lemma [l] our proof of this theorem follows the same reasoning as that of Chen and Chen 



(2011). Only slight modifications are needed; we give the details in the appendix for completeness. The 
proof consists of two parts that separate the treatment of incorrect and of true models: 

(a) An incorrect sparse model is a model J with J ^ J* . In such a model, the distance between (pj and 
(j)* will be large enough such that the likelihood function of model J achieves only low values. The 
model will thus not be chosen over the true model J* . Specifically, the lower-bound on the signal in 
assumption (B5) ensures that the change in the EBIC when comparing model J to model J* , is at 
least on the order of log(np). 

(b) A true model is a model J with J D J*. In an overly-large true model, the achievable increase 
in likelihood due to the extra degrees of freedom will not be large enough to compensate for the 
increased model size, and so again J will not be chosen over the smallest true model J* . Specifically, 
the increase in the achievable log-likelihood will be bounded on the order of | log(np), which 
will be outweighed by the additional penalty on the larger model J. 

6.4. Proof outline for Theorem |4} Considering each of the p regressions separately, we obtain consistency 
of the EBIC with probability at least 1 — n^^p^^ via Theorem I2I as long as all the conditions (B1)-(B5) 
hold. Using our assumptions for this current theorem, all these conditions hold by assumption, except for the 
eigenvalue bounds on E [X^ jX'[j] for all | J| < 2q. We derive these bounds in the appendix, using properties 
of the logistic model combined with the conditions assumed to be true. 

7. Conclusion 

As discussed in detail in the introduction in Section [T] the results in this paper make a formal connection 
between Bayesian model determination and model search using recently-proposed extended Bayesian infor- 
mation criteria (EBIC). Our results pertain to sparse high-dimensional generalized linear models based on 
a one-dimensional univariate exponential family and with canonical link. Evidently, a number of generaliza- 
tions would be of interest for future work. 

Remaining in the univarate exponential family framework, regression under non-canonical link could be 
considered in a fashion similar to what we have done here. Very recently, a treatment of this problem in the 



vein of Chen and Chen (2011) has been undertaken by Luo and Chen (2011b). Another extension would 
be to allow for exponential families with more than one parameter in regression models. This would in 
particular recover results for linear regression with unknown variance as a special case. 

A different paradigm would be the graphical model setting. As reviewed in Section [5. H there is a version 
of the EBIC that enjoys consistency properties in the Gaussian case. However, it remains an open problem 
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to establish a formal connection to fully Bayesian graph selection procedures. Moreover, we hope that the 
available consistency results can be strengthened to avoid, in particular, decomposability assumptions for 
the concerned graphs. 
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Appendix A. Proof of Theorem [T] 

Theorem [l| Assume that conditions (B1)-(B5) hold, and that either assumption (Al) or (A2) holds. 
Moreover, assume the following mild conditions on the family of priors {fj : J d [p],\J\ < q), which require 
the existence of constants < Fi, F2, < oo such that, uniformly for all \J\ < q, we have 

(i) an upper bound on the priors: 

sup0.,/j((?!)j) < -Fi < 00, 

(ii) a lower bound on the priors over a compact set: 

inf ||^^ll^<^+i/j(0j) > F2 > 0, 

where R is a function of the constants in assumptions (Al) or (A2) and (B1)-(B5), defined in the 
proofs, 

(iii) a Lipschitz property on the same compact set: 

sup U.j\\2<R+l l|V/.7(0.7)|l2 < -P3 < 00. 

-I 1/2 3/2 

Then there is a constant C , no larger than 4F3F2 Xi + 2qX3X^ + 2, such that, for sufficiently large n, 
the event that 

(9) 



Bayes(J) = P{J) ■ L^^}{lj)fSj) ■ H^j) ^'^ {2nf^/^ ■(i±Cy 



uniformly for all models J with \ J\ < q occurs with probability at least 1 — (np) ^ under (Al), and with 
probability at least 1 — (np)^^ — AK^^^n 2 under (A2). In particular, for the (unnormalized) prior 
P{J) = L^jV^' ■ 1 {\J\ < q}, it holds that 



(10) 



|log (Bayes^(J)) - (-iBIC^(J))| < Ci 



where Ci is a constant no larger than | log(27r) + ^q\og{2q) + glogmaxjA]^ ^, A2} + log maxlfi , F2 ^} + 1. 

Proof. First, wc show that the approximation ^ to the Bayesian marginal likelihood will imply the bound ( 10 1. 
We have 



log (Bayes^(J)) - log P{J) ■ L^n]{^j)fj{<j>j) ■ Hj{cj,j) (2^)''^'/^ ■ \ l±C 



log(np) 



= -7 log (^1^1 j + logi[„](^j) + log/j(^j) - \ log Hj(^j) + ^ log(2^) + log j^l ± C 
We now approximate some of the above terms. First, by definition of we have 



log(np) 



kllogb) > log 



> log 



|J|I^I 



> log 



p 

2\J\ 



> \J\\og{p) - qlogi2q) 



Next, since Ainlj ^ Hj{(j)j) ^ A2nlj, we have 

I J| log(n) + I J| log(Ai) = log lAmljl < log Hji^j) < log |A2nl7| = | J| log(n) + \J\ log(A2) . 



Finally, F2 < log fj{4>j) < Fi by assumption, and for sufficiently large n, Cy < ^. Combining all of 
the above, we get 

log (Bayes^(J)) = logL[„](^j) - ^ log(n) - 7I J| log(p) ± Ci - -iBIC^(J) ± Ci , 
where we define 

Ci := |log(27r) +7glog(2g) + glogmax{A]"\A2} + logmax{Fi,F2"^} + 1 . 
We next prove the approximation We need to show that 



*J6 
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for all J with | J| < q. 

To this end, for each model J, we split the integration domain into three regions — a small neighborhood of 
the MLE denoted by Afi , a larger region JV2\JVi obtained by taking a larger neighborhood Af2 and subtracting 
the first region, and the remainder of the space, given by M'^\A/'2. 

Fix any model J with \J\ < q, and let be the MLE. Define the neighborhoods 



M := {0 : ||i/j(0j)V^(0j - < v/41og(np)} , 



M2 := {0 

(We assume n is large so that Mi C A/'2.) We write 
/ L^n\{(t>j)fj{<l^j)d(j)j = 



'J - 9,7. 



< ■\/ Ai?i 



} 



[n]i<f>j)fj{(l>j) d(t>j 



(Intl) (Int2) (Int3) 

We now approximate to each of the three terms separately. First, by Lemma [ijiv), for a point e J\^2^ 
\\Hj((l)jY/'^{(j)j — 0j)|l2 < \/Ai7i implies — 4>j\\2 < 1- Therefore, integrals (Intl) and (Int2) are both 
computed in a neighborhood of radius 1 around 4>j. The main idea for the computations below is that the 
contributions of (Int2) and (Int3) are negligible, while the value of (Intl) can be very closely approximated 
by using the second-order Taylor series expansion to the likelihood. 

Approximating (Intl). In a very small neighborhood around 0,/, the quadratic approximation 

i i 

^Y^^Sj) - \{^.J - 0j)^i/j(0j)(0j - 0j) 

i 

is very accurate, and we can therefore use a Laplace approximation to the integral in this small neighborhood, 
to obtain 



(Intl) 



^^exp|^^,(0j)- i| 



0j)^i7j(0j)((/.j-(/.j) /j(0j)#j 



{2TTr'^fS.j)\HSj) 



-1/2 



exp 



To make this approximation rigorous, we begin by giving precise bounds on the approximation to the 
likelihood in a neighborhood of (pj. By Lemma 111 iv), for any 0j with — 4>j\\2 < 1, for some t G [0, 1], 
we have 

E {^Mj) - ^.(^J)) = ^Jsj(?j) ~\i<t>J~ hfHji^j + t[c^j - ^j))i(t>j - 0j) 

i 

= -^(0,7 - ljfHj($j + t{cpj - 0j))(0j - 0j) 

= -\{<t>J - ^jfHj(^j){cj)j - M ± ^1107 - hWl \\Hjih + tiPj - M) ~ Hj{$j) 
(11) = -^(07 - ^jfHj{^j){<j>j - 0j) ± ^Uj - ^jWlnXs . 

(Here |lAf|lsp denotes the spectral norm of the matrix M.) Recall that for all (f)j G Ni C A/2, we have 
\\4>J ~ 4'j\\2 < 1- Applying the approximation for all (f>j e A/i, we claim that 



sp 



(Intl) = (27r)'-"/V7(?7) 



H 



-1/2 



3/2 



1 



log(np) 



We will now prove this bound. 
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Applying Lemma [ijiii) , |l(/!'j||2 < 110./ II 2 + \\4>J ^ <pj\\2 < R+ 1- By om^ assumptions on /j on the ball of 
radius i? + 1 at zero, 



log /. 



JAM 



Next, since by Lemma[l|iv) we know that Hj{4>j) > Ainlj, we apply the definition of Mi to obtain 



Il0.-?.II^^A3< J^^^-||0,-<,,||2 



Applying the three above bounds, we obtain an upper bound on (Intl): 

(Intl) = cxp d<Pj 
= exp|^^,(<^j)|^ expj-^ 

< exp 1^^.(0,7) I exp - hfHSj){<l^j - M (^1 - ^j^^^^^^ | fAM d^j 
<exp|^£,(0j)+log/j(0j) 

/. 



1 4 log(rip) 



X / exp - M-H.M,Mj -mU- f-"^^ 



Changing variables to ^ = 1 1 



4\l log(wij) 



1/2 



H.j{(j).jy^^{(j)j — 4>,j), the upper bound becomes 



exp { Y: eSj) + log JAM + J ^-^^^F,F-' 



I 



H 



V2/ _ /4AilogM 



-\J\/2 



Afr 



/ ; 4,2, , , x-V2 , exp<{ -^llella 1^ rfC 



< exp { + log + W'^FaK 



./1/2 



Ain 



3-^2 



x^^^^(2^)-''^'/^exp|-l||e||^| 



F^F^' > ■ 



H 



1 - 



1 4Ai log(np) 



J\/2 



Afr 



1/2 
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We similarly obtain a lower bound: 
(Intl) = exp |^^i(</>7)| fAM dcl>j 

= exp|^£i(0j)| j/^ exp|-i( 

> exp exp - h fHji^jXc^j - h) (l + \J^^^^^^ } /• 

>exp|5:4(^.) + log/.(?.)- 
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d(j)j 



Ain 



Changing variables to ^ = ^1 + ^/-^^^^^^) Hj{<^jy/^{<pj — <j)j), the lower bound becomes 



Ain 



F^F^^ \ ■ 



V^/ ^ /4AilogM 



1/2 



Afn 



(27r)' 



^1/2 



X / . /ii^i^x-V^ , (27r)-'^'/^exp^-^||e||2^ #^ 



1, 



41og(np) 



>exp<j^£i(</>j) + log/j(</.j)- y 
xPjxfji <21og(np)} 



> exp { Y.^Sj)+^oMh) - \r-¥?^FsF,-^ ^ • 



1/2 



Ain 



1/2 



/4A§log(np)\ 
A?n J 



/ 4A§ log(np) 
Afn 



-IJ'l/2 



-IJI/2 



(27r)'^'/^ 



for sufficiently large n. 

Combining the upper and lower bounds, we therefore have 



(Intl) = exp <j ^^i(^j) + log /j(^j) 



Ulog(np) F3 
Ain F2 



V^/ / 4Ailog(np) V 



^1/2 



(277)'^'/= -(I-C), 



for some c satisfying < c < e iog("p)/2. Since log(np) = o(n), we can thus write 



(Intl) = (27r)'''/^/j(^j) HMj) exp \ ^iSj) 



1/2 



(-( 



4F,F,-A,-/" + 2,A3Ar'= + l)-\/^ 
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Bounding (Int2). For \\(l)j~4>j\\2 < 1, we can apply Lemma[l|iii) and (iv) to see that Hj{(j)j) >: Xinlj ^ 
\iX2^Hj{(f)j). Therefore, by the Taylor series approximation, for H'/',/ — 0,7||2 ^ li since sj{<j)j) = 0, we have 



(12) 

and so 
(Int2) 



Ai A, 



bj-cj>jfHj{cj>j){cl>j-(j>j) 



expJ V^,(<^j)l/j(0j)d0 

Jf,7(0,/)V2(0J-0j)|l,<V^ { , J 

, _ exp<^V£,' 

V41og(np)<||ff,7(0j)V2(0,„0,)|j^ y ^ 

Fiexp|^£,(?,,)| \HSj)\''\\2X-^r'^ 



Wi log(np)<|| 



elli>21og(np) 



1 



exp --iieii^ 



i^iexp|^£,(^,,)| \hSj)\ ■ (27r)'^"/^p{xfj| > 21ogM} 



1/2 



<Fiexp<^^ Hj{cj,j) (A2A^^)'■"/^•(27r)'^"/^•e-'°s("P)/^ 



by the chi-square tail bounds derived by Cai 



(2002) 



Bounding (Int3). For all 0,/ such that 



Hj{<t>jy'^{ct>j ' 4>j) 



= \/Ai7i, by ( 12 1, we know that 



AiAj"^ ■ "v/Ain 



and so by convexity of likelihood, for all such that 



A1A2 ^ • y/Xin 



Therefore, 



(IntS) = / exp I V e,iM \ fjiM d(t>j 

■J\\Hj{^.,)yH4>j-?j)L>V^ I , I 



<i^iexpJ V£,(^j)l / 



exp 



A1A2 ^ • \/Xin 



Changing variables to ^ = H ,;{(/>, jY^^ {(/),; — the integral is equal to 

-V2 r r AiA2"^ • ^A^^ 



= Fiexp|^^,(^j)| 
<i^iexp|^€,(0j)| 



exp 



C||2>\Airi 



Ilell2 '^C 



'^'•exp{-n.A?(4gA2)-i} , 



where the last inequality is proved as follows: 
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/ exp <^ 

JU\\2>V^ I 



AiAo ^ • \/\in 



2>VXin 



^i■M\\2>^/^ 



2 

exp 



Iieii2 



AiA, ^ • \/Xin 



2 

<2lV expj-^ 
-'?eKi,||«ll=o>V^in|J|-i ( 2A2Vkl 



•kr'/^(ei + ---+eij|)^ 



= 2^''\ 

< 2l'^l • I J| 

< P \ Exp 



2A2v^ 



niax{Zi, . . . , Z|j|} > y/Xin\J\ i : Zi, . . . , Z|j| ~ Exp 



Ai-5^ 
2A2VIJT 



Ai-^V^ 



-1^1 



2A2v/|^l 



> 



VAin|J|- 



2A2v/|J| 



exp ■ 



VAin|J|- 



1 




A}^ 


/n 


2A2V 





Combining the bounds. Applying our approximation of (Intl) and bounds on (Int2) and (IntS), we 
have 



/ exp 1^^,(0,;) I 



fj{(j)j) d4>j = (Intl) + (hit2) + (Int3) 



{2TTr''fSj) \hSj) 



-1/2 



exp 



1 ± ( 4F3F2-IA1 + 2gA3Ai + 1 



'/2 



log(np) 
n 



1-1/ 



±i^iexp<^^^,(?j)^ '(A2Ar')''"/^-(27r)'"/=.e-'°s("P)/2 



1/2 



±i^iexp|^^,(0j)| •exp{-n.A?(4gA2)-i} 
= (2^)''''/Vj(0j) |i/j(0j)|"'^%xp|^£, ^ 

• (1 ± (4^3F2-A-/= + 2,A3A7^/= + 1^ y ^ .... 2 ^ ^- . ^„.,2(,,,,,-. 
= (2^)''^'/Vj(0j) |iJj(0,7)p'^'exp • ± {AF:iF^^\-''' + 2gA3A7'/^ + 2 

for sufficiently large n. 



log(np) 



□ 



Appendix B. Proof of Theorem [2] 
Incorrect models. Fix any J ^ J* with |J| < q. We first consider the loss in likelihood resulting 

from excluding one (or more) of the true covariates. Recall that — o(min{|(/)*| : j G J*}) by 

assumption — we use this in several inequalities below, marked with a *. 
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We apply Lemma llTii) to the set J' = J U J* with tpj' — (pj — and obtain 



log i [„]('/' J ) - logL[„](0*) = logi[„](0* + (0,7 - (j)*)) - logL[„](0*) 



< 


K 






n 








2 








\l 






< 




mm 








< 










mm 




4~ 





min |l 



log 



min \ 1, min 



2 

1 . 

- mm 

2 jeJ' 



Then 



BIC^(J) - BIC^(J*) = -21ogL[„](0j) + 21ogL[„]((^jO + (| J| - | J*|) log(n) + 27 (| J| - |J*|)log(p) 
> -21ogL[„](0j) + 21ogL[„](0*) + (|J| - |J*|)log(n) +27(|J| - |J*|)log(p) 



2 je 



in — 2(7 log ^71 ^/^p'*'^ 



> • mm 

■ I ,*|2 \*in . I ,,,2 



mm 0, 



> • mm 1^., I 

> — — • mm 0,- 



> log(p) 



for sufficiently large n. 

True models. Fix J D J* with | J| < g. We first compute an upper bound on the increase in likelihood 
due to including additional (false) covariates. For sufficiently large n, we apply Lemma [ijii) and (iv) and 
obtain, for some t £ [0, 1], 



logL[„](^,/) - log i[„] ((/)*) 







b*) - 








<( 


(f>J - <l>*fsj{q 


/,*) _ 




rfHj{<t>*){h-c 




<( 




/,*) _ 








< 













< sup ( z sj 



|J\J-|log(n"p'+») , 



log(n) 
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where the last inequahty is obtained by applying Lemma [TJ^i) . Hence, 
BICt,(J) - BIC^(J*) 

= -2 log L[„] + 2 log L[„] ) + \J\r I log(n) + 2-f\J\r \ log(p) 
> -21ogL[„](0j) + 21ogL[„](0*) + |J\J*|log(n) + 27|J\J*|log(p) 



> -2 1 



= 2| J\ J* I log(p) . + ^ - (1 + o(l)) f 1 + + 

>iogb).(7 + ^-(i + /3 + |: 

for sufficiently large n. 

Appendix C. Proof of Theorem [4] 

Theorem |4| Assume that conditions (C1)-(C4) hold. Let Xi,, . . . , Xn, G {0,1}^ be i.i.d. draws from an 
Ising model with parameters C^* £ MP and Q* £ M''^'', where Q* is symmetric with zero diagonals. Let G* 
he the graph with edges indicating the nonzero entries of Q* , and for each node j, let S* denote its true 
neighborhood, that is, Sj = {k ^ j : 0*^. ^ 0}. Choose three scalars a,/?, 7 to satisfy 

r 7>i-2k+/?+^. z/^>o, 

\ a e (0, \) and P>Q, if k = 0. 
Then, for sufficiently large n, the event that the inequalities 

BIC^(5;) < min {BIC^(5,) : S, ^ j, S, + S* , \S,\ < q} - log(p) . (j - (l - + ^ + ^ 

hold simultaneously for all j has probability at least 1 — In particular, the EBIC is consistent 

for neighborhood selection (simultaneously for all nodes) in the Ising model, whenever 7 > 2 — 

Proof. Considering each of the p regressions separately, we obtain consistency of the extended BIG with 
probability at least 1 — n~°'p~^ via Theorem [2] as long as all the conditions (B1)-(B5) hold. Using our 
assumptions for this current theorem, all these conditions hold by assumption, except for the eigenvalue 
bounds on E [X^jX^j] for all \ J\ < 2q, which we now derive from properties of the logistic model combined 
with the conditions assumed to be true. 

We need to find constants 01,02 > such that, for all |J| < 2q, ailj ^ E [Xj^jXjj] ^ a2lj. We now 

show that setting ai = 2^ a3(g+i)'^ ^2 = 2(7 will satisfy this bound. 

Fix any unit vector u with support on | J| < 2q. We will show that ai < E [(X^u)^] < 02- Since 
{Xii, . . . ,Xip) takes values in {0, 1}^, we have E [(Xj^w)^] < ||u||^ < 2g||u||2 — 2q — 02- Next, we find a 
lower bound. Choose jo to maximize u'j^; then Uj^ > Let Jq = ^\{jo}- We have 

E[{X^juf] -E[E [iXfju)^\X,j^]] 

= E [{Xlj^u,„f + 2{X^j^uj„)u,„E [X.jXuo] + <E [XljXuo]] 



E 



{{X,j,;E[X,,,\X^jjfu,,)' + ul (e [XljX,,,,] - E [X,,,\X,j,\ 



= E [{{X,Jo■MXl■J^X^J,^fu,,)^ + <Var [X^jX.j,)] 



>iij^E[Var(XyjXi,;J] 
> lE[Var(Xi,jXi,;J] 
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Now take any fixed value of a^[p]\{jo}- Using tlie logistic model, 



We also have 



Var (XijJXijp]\{j„} = a;[p]\{j„}) 

(l + exp|c,o+Efc^,o^fc0,V.} 

Oo + J2k:{3o,k)€G' ^kQ*^k < lOol + 9supj-,fc < a3(g + 1), and so 
Var {Xijg l^i,[p]\{io} = ^lp]\{jo}) > 



|t|<a3(9+l) (1 + e*)2 (1 +£''3(9+1)) " 

Since this is true for any X[p]\[jgj, we therefore have Yai{Xijg\Xijg) > f^-^'^^a^ti+i)^ everywhere, and 

1 1 013(9+1) 



so 



Appendix D. Proof of Lemma [T] 

Lemma [T| Fix any a, 13 > 0. Assume (B1)-(B5) hold, and that either (Al) or (A2) holds. For suffi- 
ciently large n, with probability at least 1 — n~"p~^ under (Al), or with probability at least 1 — n^°'p^^ — 
AK^^^n 2 under (A2), the following statements are all true. The symbols Ci, C2, X\, t, R, Ai, A2, and 
A3 appearing in the statements represent constants that do not depend on n, p, or on the data, but generally 
are functions of other constants appearing in our assumptions. 

(i) The gradient of the likelihood is bounded at the true parameter vector cj)* : 
(13) 5.7(0*) ^ < (1 + e„) I J\ J* I log (n^pi+zS) for all J ^ J* with \ J\ < 2q , 



(14) 



where e„ ^C,^^^^^^ + C,^^^ oil). 
(ii) Likelihood is upper-bounded by a quadratic function: 



log 



2 min{l, II-0JII2} - T 



log (n"pi+'3) 



for all \J\ < 2q, ipj € 



(iii) For all sparse models, the MLE lies inside a compact set: 

(15) \\(^j\\2< R for all\J\ <2q . 

(iv) The eigenvalues of the Hessian are bounded from above and below, and local changes in the Hessian 
are bounded from above, on the relevant compact set: 

(16) For all \J\ < 2q, ||0j||2 <R+l, Xilj ^ ^Hj{c^j) ^ A2I,/ , 

(17) and for all U.jh, U'jh <R+l,T^ (HjiM ~ Hj[c^'j)) ^ - <^:;||2A3lj . 
We present the proofs of the various claims separately. 

D.l. Bound on the score at (j)*: proving (13). The following lemma is proved later, in Section [E| 

Lemma 3. Fix any radius r > 0. There exist finite positive constants c, /3i — Pi{r), = P2{t), and 
h = Pzir) such that 

Pilj ^ T,Hj{(t,j) ^ P2IJ for all \J\ < 2q and all \\(j>j\\2 < r, 

and i (Hjicf^j) - Hj{<j>'j)) ^ ||0j ~ <i>'j\\2 ■ fi^lj for all \J\ < 2q and all ^j^, U'jh < r, 

with probability at least 1 — p^'^e"™ under (Al ) or with probability at least 1 — 2K^^^pn^'^/'^ — p^'^e"™ under 
(A2). 
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For large n, since log(p) — o{n) and q is constant, we have p^'^e ™ < e < |n Therefore, by 



Lemma [sj with probabihty at least 1 — |7i "p ^ under (Al) or with probability at least 1 — 
2K^+^pn-''/^ under (A2), 

(18) A*Ij ^ ^H.j{(f>.j) ^ X*Ij for all | J| < 2g and all ||0j||2 < as + 1, 

(19) and i (i/j(0j) - i?j(0j)) ^ Uj - ^'jh ■ A^Ij for all \J\ < 2q and ah W^jh, U'jh < as + 1, 



where := /3fc(a3 + 1) for k = 1,2,3. For the remainder of these proofs we assume that (181 and (191 are 
true. 

We now bound the magnitude of the score. (We adapt the proof from |Chen and Chen ( 2011 )) . By Lemma 
2 of Chen and Chen pOll), there is a constant Uq such that, for all J with \J\ < 2q, there exists a set of 
unit vectors Uj C K'' with \Uj\ < Uq, such that for all v G M"', ||i;||2 < v^l + emax^g;^^ u^v. 

Now fix any J D J* with \J\ < 2q, and any u G Uj. Below, we will show that 



< exp ■ 



By the definition of Uj, we then have 



(i/j(0*)-V^) sj(0*) > •v/2x/rT^ I J\J* I log (nV+^) 

I J. . „ i+,3N A /yrT^-2<7log (nV+'^) 
-^/r + ^UV |log(n p +0 1-^ {XmXD-^n 



||(iJj(0*)-V2) sj(0*)||^ > ^j2VT+~e \J\J* I log {n^p^+P) ■ ^TT^ 

< 5] pjw^ (iJj(0*)-v-^) sj(0*) > y^2VTT^|J\J1iog(nV+^)| 

new,; ^ 



Therefore, 



3J c[plJ2J*AJ\<2q,\\[Hj{<t>*)-'/') s,7(0*)||^ > ^^2(1 + e) \J\J*\\og{n^p^+f>) 

(i/j(0*)-v^) ^^('^*)|L - \/2\/rT^ I j\ J* I log (n^pi+zs) . ^r+^l 



2g-|J*| 

^ E E 

Af=l JC[p],J2JM 7X7*1 = ^ 
2q-\.J'\ 



< 



E 



exp <^ -VT+e\J\J*\\og (n>i+^) 1 



2g-|J*| 

< exp<( -Ar\/rT^log(nV"^'^) I 1 



Vl + e- 2(7 log {n°'p^+P) 



N=l 



'\/r+l- 2glog (nV+'^) 
(A*i)3(A5)-2r. 

-7Vlog(p)+log(C/o) 



< ^ exp <( -iV%/rTelog (nV) ( 1 



VTTi- 2(7log {n'^p^+P) 



We can simplify the expression above, as long as e is large enough to allow us to remove the vanishing terms 
inside the parentheses. In fact, for 



e = 3 



/ 4glog(nV+/^7 , ^ -11 ^ / log (nV+Z^) 
V (At)3(A5)-^n fogJt/o):=C,^ ^ 



C2 



1 



log(n) ' 
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we get 



P jaj C H, J 2 J*, I J| < 2(j, II sj{^*)\l > ^J2{l + e)\J\J*\\og{n-p^+0) 

oo 

< -iV^rTelog(nV) 1 



VTTi- 2qlog (nV+'^) 



oo 



< ^ exp {-AT log (nV) - log(6)} = 



1 



6(1 -n-"p-/^) - 3 
which completes the proof, except that it remains to be shown that 

,T 



> ,j2VT+~e\J\J*\ log (n^p^+P) 

< exp ^Vl + e \J\J I log (n p +0 (^1 - ^ (xm)^ j j ' 

The proof of this remaining inequality follows the techniques of Chen and Chen (2011 1; we include it here 
for completeness, since we require a slightly more detailed analysis of the probabilities involved in order to 
obtain consistency results for the graphical models setting, as in Theorem |4] 

Let u € M'^ be a unit vector. We now compute an upper bound on the quantity (_ff,7(0*)^^/^) sj(0*) 
that holds with high probability. Since sj {(/)*) = J^i-^ijO^i ~ ^1)5 '^e have 



Next, for convenience we write A :— \J 2\/\ + e | J\ J* | log (n"pi+^) and = A - ^/^) u. Since 

Var {sM*)) = H.M*) = XuX^h" (Xlcj,*), we have 

Y: {Xli^jf ■ h"{Xlc^*) ^A^Y. (^^v {HMr'^) uf ■ h"{Xl<^*) 

i i 



/2 



= A' ■ 



2 „,T 



And, 



We then have 



Wi^jg = A^ II (^Hjirr'/') u\\^ < A' ■ \\Hji^*r\p ■ Ml < A^Xlny^ 



E 



E 



1 |^E(^» - ■ (Hj{cj>*)-''^) u>A 
l\^{Y,~^i,)■XfJ^J>A^ 

exp|^(y,-AiO-^5^J-^'| 
= exp - E • ^iSV'jj • E exp |^ 
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= exp |-^' - ^ ■ ^fj^jj • n E [exP {^^ • ^ui'j}] 

= exp j-A^ - E ^« • ^^J^J^ • n exp { [b {Xl {<p* + i^j)) - b (X;:^*)] } 

= exp - E • ^i^^^l ■ ''''P ^ ^'^"^ + " } ' 

where the next-to-last step comes from the properties of exponential families. 
By the Taylor series approximation, for some t G [0, 1], 

i 

= 5: b' {xi4>*) ■ xii,,, + lb" ) • ixrM' + lixijr (b" {xr.{<p* + - b" {xir)) 

i 

= • Xr.^j + lb" (Xlr) ■ {XlM'^ + i^J (^Y. ^^J^^J (b" (XTM + - b" (^.^r ))) i^J 



2 2(A^)i-5nO-5 • 
Continuing from above, we obtain the desired inequality as follows: 



V[u^ (HM*)-y^)sj{c^*)>A] 
< exp |-^' - E • ^u^'J^ ■ exp |e [b {xl. {<P* + ^j)) - b {xj,^*)] | 



< exp 



2 2(A*)i-5nO-5 



exp I -^/TTe I J\ J* I log 1 



' yrTe • 2g log 

(AI)=^(A5)-2n 



D.2. Accuracy of MLE for true sparse models: proving (14). Assume that (18 1, (19) and (13) hold 



Fix J with J D J*, I J| < 2q, and fix any ipj with ||''/',7||2 < 1- Then 

logL[„](0* + V,/) -logL[„](0*) = ^JsjI.^*) - \iJ^,Hj{c^* +ti;j)i^j 



< Ujh-\\sj{r)\\2-Uj\ 



2 A^ri 



< ll^.7||2 • ^2{l + e)\J\J*\\og{n-p^+P) ■ ^X^- ■ ^ 
1 11^.112 f IIV..II2 - ^^^^f^ . Vl6(l + .),A;(At)- 

log(n"pi+/^)\ 



2 
A^n 



||^j||2 llV^/lb - T 
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where t = ^16(1 + e)q\2{Xl)^'^ . By convexity of the log-hkehhood, this means that for aU ■0,/ G M'^, 



logi[„]((/.* + ijj) - logi[„]((/.*) < -^||Vj||2 min{l, 11^112} - T 



log 



which proves (14). In particular, since logL[„]((^j) > log _L[„] (0*), applying the convexity of log-likelihood, 
we must have (for sufficiently large n) 



Uj-'I^*\\2 < 



log (n^p^+P) 



D.3. Compact set containing all sparse MLEs: proving (15), (16), and (17). Assume that (I8I, 
([19|, ([13]), and ([14| hold. Let 

i?:=l + a3+4(A*)-i (^^2(1 + e){l + a + I3)alq\* + ^ajX*^ . 

We now show that ||0,7||2 < R for all |J| < 2q. We will use the fact that, since the zero coefficient vector 
G W is contained in every model J, the coefficient vector 0j must yield higher likelihood than the vector 
0. 

First, we compute a lower bound for L[„](0): 



iogL[„](o) - iogi[„](r) = i-rfsj-^ir) - \{-rfHj.{tcf>*){-q^*) 



> -^r^Hj,{r)r\\Hj.{ci,*r'^'sj,{r)\l - \<i^*^Hj,m*w 

> ■ • + I A^l log (nV+^) - \al ■ n\; 

= -n [^J2{l + e){l + a + /3)a2^A*yi5^^!^^^^ + \alX*^ 

> (^^2{\ + e)(l + a + P)alq\l + ^a^A^^ , 

for sufficiently large n, since log (n"p^+'^) = o(n). 

Next, we consider L[„](0,/), and find that since logi[„](0,7) > logL[„](0) by definition, this results in a 
bound on ||0j||2. Fix any J with \ J\ < q. If \\^j - (/)*|i2 < 1, then \\$j\\2 < 1 + ||(/'*||2 < 1 + as < i?- Now 
consider the case that — (/)*||2 > 1- Applying (14) to the model J' := J U J* with ipji := — <})* , we 
obtain 



logL[„](0* + (^j - 0*)) - logL[„](0*) < -^ll^j - ri|2 ("minll, - ^ II2} - T^j 



log (n^pi+Z^) 



< 



2 
A^n 



ii'/'j-rii2(^ 
ii?,7-rii2 , 



1 - T 



log (n^pi+z^) 



for sufficiently large n, since log(j3) = o(n). Combining these results, we obtain 

-n (^^j2{l + €){\ + a + p)alqX1, + ^a^A^j < logL[„](0) - logL[„](0*) 



< logL[„]((}!).7) - log £[„]((/)* 
<-^||0,-ril2. 
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Therefore, 

||0,7||2 < II0II2 + 110.7 - < 03 + 4(A*)-1 



2(1 + e)(l + a + /3)a2gA* + -a^A* < R 



FinaUy, define \k — l3k{R+l) for fc = 1,2,3. As in Section D.l we apply Lemma [s] and see that with 



probability at least 1 — \n ^ 



2K 



K+l 



pn 



under (A2), 



2K^~^^pn '^^^ under (Al) or with probability at least 1 — °'p ^ 



Ailj ^ ^Hj{(j)j) < A2l,7 for ah |J| < 2q and all Wcft.jh <R+1, 

and i {Hj{<j)j) - Hj{(j)'j)) ^ - cj^'jh ■ X3I, for aU |J| < 2q and ah ||0j||2, U'jh <R+l. 

Appendix E. Proof of Lemma [3] 
We now prove the bounds on the Hessian. 

Lemma [3| Fix any radius r > 0. There exist finite positive constants c, /3i = [3i{r), — (32{r), and 
= Paif) such that 

^ ^Hj{(f>j) ^ (32lj for all \J\ < 2q and all < r, 

and i (ff/(0j) - Hji^'j)) 0'j||2 • klj for all \J\ < 2q and all ^j^, U'jh < r, 

with probability at least 1 — p^''e~™ under (Al ) or with probability at least 1 — 2K^^^pnr'^/'^ — p^'^e"™ under 
(A2). 



Proof Under (Al), E lATi^f < A^, while under (A2), E \Xij\ 

-^A''''^'' appropriate. By Lemma 2 
J with I J| =2q and for all with ||0| 

r\/m 



< A 



'■/(3K) 

K 



Define m to equal A or 

21 ^-l 



Hj{ct,)^nlj-'^mi{h"{e):\e\<2Qq^ 



below, with probability at least 1 - (|Je'(^^°- ^80?'™'^! 'D for ah 
2 <r. 

Now we show an upper bound and bound the difference. By Lemma [4] below, with probability one under 
(Al) or with probability at least 1 — 2K^^^pn~'^/'^ under (A2), for all J with |J| < 2q and all 4>j,(t>j with 
||0,/||2,||0'j||2 <r, 

HjiM - Hj{4>[j) < Uj - cj)'jh ■ Ci{r) ■ nlj . 

In particular, this implies that 

Hji<j>j) = Hj{<Pj) - Hj{Oj) ^ - 0j||2 • Ci(r) • nlj ^ rCi(r) • nlj . 



Let /3i(r) 



4 



inf {h"{e) :\0\< 20q^ry/m \80q'^ma^^] }, and ^2(r), Psir) := Ci(r). This proves the claim. 



□ 



E.l. Bounding the change in the Hessian when x's are subgaussian. 

Lemma 4. For any radius r > 0, there exists finite C'l = Ci(r) such that for any sample under (Al), or with 
probability at least l — 2K^^^pn^'^^^ under (A2), for all J with \ J\ < 2q, for all (f>j, with \\(t>,j\\2, 110/11 < ^, 

i (HjiM - HjicP'j)) ^ Uj - 0'j||2CiIj . 

Proof. For some convex combination 0j = t(j)j + (1 — i)0j, 

||i?j(0j) - Hj{^'j)\\sp < WHAM - ii.M'j)\\p 

i 

Y^XuXl ■ ■ {Xlicpj - cj^'j)) 

i 

< wx.jxiw^ ■ KixiMi ■ \{xr.{<i>j - m 

i 

< 110,7 -0jll2-Ell^-'llMb"'(^.^'^")| ■ 
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Under assumption (Al), since \Xij\ < A for all 

lli^,7('/'j)-i?j(0'j)llsp<ll'/'J-'/'jll2"- ((29)'-'A3. sup |b"'(^)n ■.^Uj-cl.'jW.n-C,. 

\ \e\<A^-r J 



Now we turn to the setting of assumption (A2). By inequality (86) of Ravikumar et al. (2011), if 
Wi, . . . , Wn are i.i.d. copies of a random variable W with E[|VF|^] < M, then 



E 



K 



and therefore, 



[| H ^» ~ E[iy] 1^] < n"/' iK/2f+^ E \^\W - E[W]\ 

i 

< n"!^ (^/2)^+' • 2^ (e[|W^|^'] + |E[T4^]| 
p| - ^ ly, > 2M'/'^' I < p| I ^ VF, - E \W\ > nM'/'' | 

i i 

= P| I ^ Wi - E[W] ^ > n-f^ilfj 

i 

^[\E.w.-nw]\ 



K 



< 



< 



We apply this result 2p times, to obtain that with probability at least 1 — 2p • K^^'^nr'^^^ , for all j € [p], 
V \X,jf < 2nA%'' , and V sup \h'"{0)\ < 2nQK(r^q)^^'' . 

i \e\<T^q\X,,\ 

Now assume that both of these bounds hold for every j. Then, for each | J| < 

\\Xu\,l < (2g)3max^ \X,/ < {2qf ■ 2nK\^ . 

i i 

Finally, for each |J| < 2g, observe that for each z, < r||Xij||2 < ry^maxj^j and so 



^ |b"'(X,^0';)f < max^ sup |b"'(0)| < 2n^K {r^^2^ 



Therefore, 



i 

<ll0./-'/':7ll2-^E(ll^«.^ll2+|b"'(^.^'/''J)| 



< 



l/K 



□ 



E.2. Positive definite Hessian. We now show that, under mild assumptions, the Hessian of the negative 
log-likclihood will be positive definite with its smallest eigenvalue bounded away from zero. 



\X 



Lemmaj2| Fix J with \ J\ — 2q, and radius R > 0. Assume X^in (E [A"-^ jA'j^'j] ) > ai > andsup^gjE 
TO. If n is sufficiently large, then with probability at least 1 — £^("'^^'^■1'^'''^ 1) for all cf) ^ cf)j with 

m2<r, 



< 



Hj{(j)) >nlj-^- inf {h" [9) : \e\ < 20gV^ \80q^ma^^] } 
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We first give a brief intuition for the proof. We have Hj{(j)) = J^i ^iJ^fj ' Due to the moment 

condition on the covariates, we know that Ylii ^u^Jj will be approximately equal to nE [X^jXj^] > n-a\lj. 
However, this is not sufficient, because for some i, we might have very small values of h"{Xj^(j)). Instead, 
we consider only those i for which h"{Xf^(t)) satisfies some lower bound. By considering the sum of XijXjj 
over this subset of the i's, we will obtain the desired result. 

Proof. From the assumptions, for all j, k G J, 



Var (Xi.Xife) < E iXifcl' 



< ^E 
- 2 



\X,,f + \X,,f 



< m . 



Let N = [SOg^maf^], and let n' = [^\. Then 



N > 



20E,,fc6jVar(Xi,Xifc) 



For each io = N, 2N, 37V, ... , {2n')N, define matrix M^'") e W""-^ as 



and define events 



i=io-{N-V) 



£;('o) = |||M(*°)||,p < \\^^ (E [X^jXlj])^ , 

p{io) ^ s^xl < lOq^N for all j e J,i = io- {N -1),..., io} . 



Define also positive constant 



bo = inf b"(6») 

\e\<20q'^r^N 



Below we show that, for the fixed choice of J, with probability at least 1 - e"(^^°' \^°<i'"^"i '1 ) 

#{io e {7V,2Ar,37V,...,(2n')Af} : £;(^°) n F^*") } > ^ . 
Now suppose that this is true. Take any io such that £'(*o) a,nd F^*''^ both occur. Then, by definition of 

- Yl ^^i^ik = M('«) + E [X^jXjj] = Mf'") + E [Xij] E [X^Jf + Cov{X^J) 

i=ia-{N-l) 



h M(^o) + Cov{X,j) h [Xmin{Cov{X,j)) - \\M^'°^ \Q Ij h lx^UCov{X,j))Ij = "^Ij 



It . 



And, by definition of F^'^°\ for all (j) with Support{4>) = J and \\(l)\\2 < r, 



b"{Xl(P) > bo for alH = io - (iV - 1), . . . , io 
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Therefore, for all (f) with Support{(f)) = J and \\(f)\\2 < r, 



i=l io=N,...,{2n')N i=io-{N-l) 

io 

y y: E x,jxrjh"{xr.^) 



E E ^ij^-' 

io:B<'o),F(<o) i=io-(A''-l) 



>- 



E 



^ n boaiN boai ^ 



It remains to be shown that 

# {io e {TV, 2Ar, 37V, ... , {2n')N} : ^('"^ n F^'") } > ^ • 

We will do this by showing that (for each iq) P {E^^o)j > o.8 and P {E^'o)} > 0.8. 

Fix any io € {1, . . . , (2n')}. First, we treat the event . By the definition of A'', we have 



E 



E ( ]V E ^ij^ik - E [XijJ^ife] I 
J,feeJ \ i=io-(Af-l) / 



= Ev^Mir E ^^.-^^^ 



i=io-(N-l) 



= ^ E Var(XyXi,) 



j,keJ 



Next, wc define matrix M(^'') € M"'''"' as Mj'^°^ = ^ ElLio-(JV-i) ^ij^ife-E [XyXi^]. We have, by Markov's 
inequality, since ||M(*°) ||sp < ||M(*«) 



p|||M(^°)||,p>^A„i„ {E[X,jX^j])^ 
<p|||M(^o)f^>iAL„(E [X.jX^j])^ 



E E Xi,X,k-E[X,,X,k]\ >^Xl,^{E[X,jX^j]) 

\j,keJ \ i=io-{N-l) J 

So, P{i;(''')} > 0.8. 

Next we consider F^'^°\ For all j, by Markov's inequality. 



> < 



p{|Xijf > lOqy/mN^ 



E 



< 



< 



E 



1^1. r 



< 



{lOqN)- 



Then 



10q^/mN - 10q^/mN 
P { = P {3i e {io - (iV - 1), . . . , io},i e J, s.t. X2. > 10g^/^7V} 

- E E'^ {^t/- > lOgVmiV} < 2qN ■ (lOgiV)-^ = 0.2 . 

i=io-{N-l)jeJ 



MODEL SELECTION IN SPARSE GENERALIZED LINEAR MODELS 



37 



Finally, for each io = N, 2N, 3N,..., {2n')N, 

p|£;('o) p|^(io)| > 1 _p|(^i;(»")y| -p|(^F(*«)y| > 0.6 . 

By the Chernoff bound, for sufficiently large n (so that the relative difference between ^ and n' = [^J is 
sufficiently small), 

n 1 . „ r„. " " ^ _ 1 

N'[ 6, 



'|#Oo : E^'"^ n F^'°^} < ^} < Binomial (2n', 0.6) < 0.6 • ^ • ( f 



< P jBinomial {2n' , 0.6) < 0.6(2n') ' (^1 - ^ 

< exp \ -0.6(2n') 



2 j 

< exp{-n- (150iV)-i} . 
So, for a fixed J with |J| = 2g, with probability at least 1 - e-i^^°■\^°1'""'^^^y''' 
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